This repository contains all the data and notebooks needed to reproduce the findings of the paper
"How populist are parties? Measuring degrees of populism in party manifestos using supervised machine learning" 
by Jessica di Cocco (Sapienza University of Rome) and Bernardo Monechi (Sony CSL Paris).

###REQUIREMENTS##########

This repository has been tested on Manjaro (21.0.2) and Ubuntu (20.0.4). Please contact the authors if you would like it to be tested, or to report that you tested it,  on other Operating Systems.

The PC configuration used for the repository testing is:

- CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz (8 cores)
- RAM: 32 GB DDR4
- Graphics: Intel UHD Graphics 630 (first graphic card) and GeForce GTX 1650 Mobile (second graphic card)
- Hard Disk: Intel SSDPEMKF010T8 NVMe 1024GB
Note that the most resource consuming part of the repository to run is the notebook (01_train_model.ipynb). We provided a sample of already trained models for users to be able to skip this part if they have lower resources.

Plaese be sure to have use python 3.8.5 or higher.
The following packages are required to reproduce the results:

matplotlib (3.4.1)
seaborn (0.11.1)
scipy (1.6.2)
numpy (1.20.1)
json (2.0.9)
sklearn (0.23.1)
pandas (1.2.3)

They can all be easily installed using the "pip" (we used 20.3.1): https://pypi.org/project/pip/

There are several tar.gz files in the repository that contains raw data, pretrained models and figures. Be sure to untar them all 
before starting.

####RAW DATA#############

The raw data of political manifestos and leaders' speeches can be found in the "datasets" directory. To create this directory untar
the datasets.tar.gz file. 
All the files are in .json format and each record of the json represents a sentence and contains the following fields:

- year: the year of the manifesto or the speech the sentence comes from
- party: the party the manifesto or the speaker belongs to
- leader: the leader of the party in that specific year (can be "null" if missing)
- text: the raw text of the sentence
- cleaned_text: a list of stemmed words obtained from the raw text

####NOTEBOOKS AND SCRIPTS#############

The notebooks are to be run according to their initial number (from 00 to 03).

- 00_generate_bag_of_words.ipynb ---- (running time 16s on our machine)

This notebook reads the raw data and uses it to create bag-of-words and labels to be used to train the classifiers.
The data is stored in the "bow_and_lables" directory. The bow data and labels are stored as numpy arrays.

- 01_train_model.ipynb ---- (running time highly depends on the chosen model and nation. A typical case is Random Forest training for Italian data taking around 1h our time on our machine, parallelizing the model training on 8 cores.)

This notebook uses the bow data and labels to train classifiers used to compute the Populist Score.
It is possible to choose for which country the classifier will be traine, together with the kind of model and
the target score to be optimized.
Then it checks the performances of the model on a test set that has been left out of the training.

Once these are chosen, the nb will perform a grid search over a set of predefined parameters for the given model and
will return the best one according to the selected target score.

Please be sure to run this nb at least once per kind of dataset in the datasets folder, to be albe to reproduce all the results.

The models are saved into the "models" folder, together with other information (e.g. the indexes for the test data).

A summary of the model is saved into the "training_results.json" file.

NOTE: we already provide a set of trained models since training is time-consuming. You can skip this nb if you do not want to train 
different models. Use the next script if you want to retrain all the models used in the paper.

- 01_train_all_models.py --- (running time aroud 1-2 days on our machine) 

This script automatize the training of single classifiers in 01_train_model.ipynb, iterating over a predefined set of countries and model types.
It is more convienent to use thi script, if you whish to retrain all the models used to reproduce the finding in this paper or train many models yourself.

This script produces and writes on the same outputs as of 01_train_model.ipynb.

Simply launch "python 01_train_all_models.py" in the main directory of this repository.


- 02_compute_scores.ipynb ---- (running time around 110s on our machine)

This notebook uses the models trained in the previous step to compute the Populist Score for each country in each year.
Be sure to have at least one trained model per nation and to set the parameters used correctly in the configuration cell at the beginning of 
the notebook (you can check "training_results.json" to find possible configurations to use).

Results are saved into the "scores" folder in .csv format. "global_scores_" files contain the Pop. Score computed for each party of the selected country.
"scores_in_time_" files contain the scores dividing the data per year.

- 03_results.ipynb ---- (running time 10s on our machine)

This notebook is used to reproduce all the results of the paper, be sure to have followed all the previous steps correctly.
If additional external data is read, it will be indicated in the corresponding section of the notebook.